Boston data setWrite a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science. These themes are summarized briefly as follows:
There are more and more open data sets available. Utilizing and sharing data is an essential skill for researchers in all fields. During this course we use open data sets from different sources and learn to prepare them for different analyses. You will explore, analyse and interpret data from real world applications.
Science thrives to be open. Repeating or reproducing the results is a common aim in any branch of science, but it is not always easy or simple. Sharing data is not enough for reproducibility. What is also needed, is using openly available software tools and methods as well as sharing your code and results. You will learn these skills during this course, using state-of-the-art tools.
Data Science is the name for the data driven world of Statistics. Nowadays, finding or collecting data is not a problem. Instead, the challenges are in extracting knowledge and discovering the patterns behind the data. It requires skills of coding, programming, and modelling, as well as visualizing and analysing. You will face all these topics on this course.
1 Tools and methods for open and reproducible research
R, RStudio, R Markdown, GitHub
2 Regression and model validation
3 Logistic regression
4 Clustering and classification
Discriminant analysis
K-means clustering
5 Dimensionality reduction techniques
Principal component analysis
Multiple correspondence analysis
6 Analysis of longitudinal data
Graphical displays and summary measures
Linear mixed effects models
After completing this course we will understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. We will know how to use R, RStudio, R Markdown, and GitHub for these tasks and also know how to learn more of these open software tools. We will also know how to apply certain statistical methods of data science, that is, data-driven statistics.
The link to my IODS GitHub repository is: https://github.com/tammisi/IODS-project.git
I’m looking forward to the course and to learning to use R and Github better at my daily work. I also expect to get a better understanding of performing basic statistical tests. I have been using bioinformatic tools, mostly R, in my PhD project, but I want to learn to be more efficient and organized in my work. I am also just starting with regression analysis, so this course comes at a very good time. So alltogether I have big expectations for the course!
I am familiar with the basics of R, so there wasn’t anything very new for me in exercise set 1. But it is always nice to repeat to refresh and enhance your memory. It was especially good to get some clarity on how to for example filter and modify the data, since there isn’t always a good logic in the way I use all the tools and functions. So one goal for me in this course is to learn to be more efficient and systematic in my R scripts. I haven’t been working with dates much in my previous work with R, so there were also some useful tips about that topic that can come in handy in the future! I’m also looking forward to the exercises on statistical analysis, because I am a novice in that area, but I am just starting to need them in my work.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Sun Dec 10 15:11:49 2023"
# libraries needed in this exercise
library(tidyverse)
library(readr)
library(ggplot2)
library(GGally)
# read in the
learning2014 <- read_csv("./data/learning2014.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## gender = col_character(),
## age = col_double(),
## attitude = col_double(),
## deep = col_double(),
## stra = col_double(),
## surf = col_double(),
## points = col_double()
## )
dim(learning2014)
## [1] 166 7
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
Answer:
The dataset has 166 rows and 7 columns. The data consists of results from a survey on student’s attitudes towards statistics and their exam points. In the data we have variables like gender and age, as well as attitude, which is student’s global attitude toward statistics, and also student’s answers to questions that have bee grouped by their relatedness to deep (deep), surface (surf) and strategic (stra) learning and that have been scaled back to the original 1-5 scale by taking the mean.
# get summary of variables
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# draw a scatter plot matrix of the variables in learning2014.
# [-1] excludes the first column (gender)
pairs(learning2014[-1])
# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
Answer:
In the summary we can see that age of the participants ranges between 17 and 55, mean being 26. Attitudes on 1-5 scale range from 1.4 to 5.0, mean is 3.1. Exam points range from 7.0 to 33.0 and the mean is 22.7.
From the plot matrix we can see that attitude correlates strongly with points (cor=0.437) and also answers related to surface and deep learning correlate with each other.
Answer:
After removing the non-significant variables, attitude is still statistically significant (p-value now 4.12e-09). The predicted change in exam points per one unit change in attitude is 3.5255. Standard error is 0.57 so pretty low. T-value also indicates that the estimated coefficient is 6.2 standard errors away from 0. The model now explains 18.6% (adjusted R-squared = 0.1856) of the variation in exam points. So the results didn’t change much when removing the non-significant variables.
par(mfrow=c(2,2))
plot(learning.lm, which=c(1,2,5))
Answer:
Residuals vs. fitted plot show that the points are quite nicely evenly distributed and don’t show any clear pattern, show the model seem to fit the data well. No modifications such as quadratic terms, are needed. Also the QQ plot showing the residuals against the same from a normal distribution look like a pretty even diagonal line, except some small deviations in the lower and higher ends, so the data seems to be normally distributed. The residuals vs leverage plot doesn’t show any strong outliers that would have a significant effect on the results. So according to the diagnostic plots, the model seems to perform well on the data.
alc <- read.csv("./data/alc.csv")
dim(alc)
## [1] 370 36
head(alc)
## X school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 2 GP F 17 U GT3 T 1 1 at_home other
## 3 3 GP F 15 U LE3 T 1 1 at_home other
## 4 4 GP F 15 U GT3 T 4 2 health services
## 5 5 GP F 16 U GT3 T 3 3 other other
## 6 6 GP M 16 U LE3 T 4 3 services other
## reason guardian traveltime studytime schoolsup famsup activities nursery
## 1 course mother 2 2 yes no no yes
## 2 course father 1 2 no yes no no
## 3 other mother 1 2 yes no no yes
## 4 home mother 1 3 no yes yes yes
## 5 home father 1 2 no yes no yes
## 6 reputation mother 1 2 no yes yes yes
## higher internet romantic famrel freetime goout Dalc Walc health failures paid
## 1 yes no no 4 3 4 1 1 3 0 no
## 2 yes yes no 5 3 3 1 1 3 0 no
## 3 yes yes no 4 3 2 2 3 3 2 yes
## 4 yes yes yes 3 2 2 1 1 5 0 yes
## 5 yes no no 4 3 2 1 2 5 0 yes
## 6 yes yes no 5 4 2 1 2 5 0 yes
## absences G1 G2 G3 alc_use high_use
## 1 5 2 8 8 1.0 FALSE
## 2 3 7 8 8 1.0 FALSE
## 3 8 10 10 11 2.5 TRUE
## 4 1 14 14 14 1.0 FALSE
## 5 2 8 12 12 1.5 FALSE
## 6 8 14 14 14 1.5 FALSE
names(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
The data is combined from two identical questionnaires related to secondary school student alcohol consumption in Portugal. There are 370 students and 36 variables. There are variables defining student characteristics such as school, sex, age and so on. Numeric variables failures, absences, G1, G2 and G3 are rounded averages. alc_use is the average of the answers related to weekday and weekend alcohol consumption and high_use is a logical variable (TRUE for ‘alc_use’ > 2 and FALSE for ‘alc_use’ <2).
I choose variables sex, school, failures and famrel.
H0 hypothesis is that the variables have no connection to alcohol consumption among students.
My H1 hypotheses are:
- sex is connected to high alcohol use (for sex=M use is higher).
- there might be differences between the schools
- higher number of past class failures and poorer quality of family relationships is connected to higher alcohol consumption.
# bar plots of all variables
mydata <- select(alc, c(high_use, school, sex, famrel, failures))
gather(mydata) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
# barplots of non-numerical variables
g1 <- ggplot(alc, aes(x=high_use)) + geom_bar() + facet_wrap('sex') + ggtitle('high_use vs sex') + theme(plot.title = element_text(hjust = 0.5))
g2 <- ggplot(alc, aes(x=high_use)) + geom_bar() + facet_wrap('school') + ggtitle('high_use vs school') + theme(plot.title = element_text(hjust = 0.5))
# boxplots of numerical variables
g3 <- ggplot(alc, aes(x=high_use, y=famrel)) + geom_boxplot() + ggtitle('high_use vs famrel') + theme(plot.title = element_text(hjust = 0.5))
ggarrange(g1,g2,g3)
#g4 <- ggplot(alc, aes(x=high_use, y=failures)) + geom_boxplot() + ggtitle('high_use vs failures') + theme(plot.title = element_text(hjust = 0.5))
#g4
# for some reason this doesn't work for failures, so I use boxplot in this case instead:
boxplot(alc$high_use, alc$failures, xlab='high use', ylab='failures', main='high_use vs. failures', names=levels(factor(alc$high_use)))
From the first barplot we can see that most students have zero and only very few have 1-3 past failures. Most of the students also have good family relations. High alcohol use is more uncommon than low alcohol use. There are a lot more students from school ‘GP’ than ‘MS’, but the ratio between female and male students is pretty even.
The barplot of high_use vs. sex indicates that larger proportion of male students are high users of alcohol, so this indicates that the hypotheses could be true.
The barplot of high_use vs. school indicates that in both schools the proportion of high users is lower than not high users. Even though the difference is bigger in GP, the difference might not be statistically significant.
The barplot of high_use vs. famrel indicates that the mean quality of family relations is lower among high users. The hypothesis could therefore be true.
The barplot of high_use vs. failures indicates that both high and low users have very few past failures. Althoug there are some individuals with higer amounts of failures among high users, I don’t think this difference is going to be statistically significant, so the hypothesis migh not hold.
# fit model
alc.glm <- glm(high_use ~ sex + school + failures + famrel, data = alc, family = "binomial")
summary(alc.glm)
##
## Call:
## glm(formula = high_use ~ sex + school + failures + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7345 -0.8246 -0.6257 1.1687 1.9935
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2982 0.5170 -0.577 0.564098
## sexM 0.9357 0.2415 3.874 0.000107 ***
## schoolMS 0.3071 0.3725 0.824 0.409704
## failures 0.6162 0.2018 3.054 0.002260 **
## famrel -0.3083 0.1276 -2.415 0.015718 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 418.96 on 365 degrees of freedom
## AIC: 428.96
##
## Number of Fisher Scoring iterations: 4
confint(alc.glm)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -1.3226184 0.71249519
## sexM 0.4671197 1.41565633
## schoolMS -0.4438613 1.02622019
## failures 0.2278800 1.02645318
## famrel -0.5609504 -0.05862812
coef(alc.glm) %>% exp
## (Intercept) sexM schoolMS failures famrel
## 0.7421802 2.5490315 1.3594796 1.8518843 0.7346998
Seems like School is not a signifant variable explaining alcohol consumption (p-value 0.409704). However,sex is a highly significant variable (p-value 0.000107). The odds of student being a high user is 2.6 times higher (exp(0.94) = 2.55) if he is a male compared with a female, if all other variables are constant.Failures is also significant (p-value 0.002260), one unit increase in number of failures increases log odds of high use by 0.62 (95% CI 0.23, 1.03) units. Famrel is also somewhat significant (p-value 0.015718), every one unit increase in the quality of family relations decreases log odds of high use by -0.31 (95% CI -0.56, -0.06) units. so the hypotheses for sex, failures and famrel were correct.
# retrain model with statistically significant variables
alc.glm <- glm(high_use ~ sex + failures + famrel, data=alc, family='binomial')
summary(alc.glm)
##
## Call:
## glm(formula = high_use ~ sex + failures + famrel, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7465 -0.8333 -0.6359 1.1559 1.9814
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2321 0.5109 -0.454 0.649564
## sexM 0.9322 0.2413 3.863 0.000112 ***
## failures 0.6144 0.2021 3.040 0.002366 **
## famrel -0.3159 0.1276 -2.476 0.013273 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 419.63 on 366 degrees of freedom
## AIC: 427.63
##
## Number of Fisher Scoring iterations: 4
# make predictions
probabilities <- predict(alc.glm, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# create logical vector indicating predictions
alc <- mutate(alc, prediction = probabilities > 0.5)
# table the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65945946 0.04054054 0.70000000
## TRUE 0.24594595 0.05405405 0.30000000
## Sum 0.90540541 0.09459459 1.00000000
# plot 'high_use' versus 'probability'
ggplot(alc, aes(x = high_use, y = probability)) + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2864865
The training error of the model is 0.286 = 28.6% of classifications are wrong. I guess this is not ideal, but better than gessing (50% chance of classification being right or wrong).
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = alc.glm, K = nrow(alc))
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2891892
The test set error (0.289) is not better than the training error (0.286).
# function to calculate train and test error
alc_function <- function(model, Data) {
probabilities <- predict(model, type = "response")
Data <- mutate(Data, probability = probabilities)
Data <- mutate(Data, prediction = probabilities > 0.5)
print(table(high_use = Data$high_use, prediction = Data$prediction) %>% prop.table() %>% addmargins())
print(paste0("train error: ", loss_func(class = Data$high_use, prob = Data$probability)))
cv <- cv.glm(data = Data, cost = loss_func, glmfit = model, K = nrow(Data))
print(paste0("cv$delta: ", cv$delta[1]))
}
# start with a model with high number of variables as predictors
alc.1 <- select(alc, c(2:4, 6:25, 28:31,36))
alc.glm1 <- glm(high_use ~., data = alc.1, family = 'binomial')
summary(alc.glm1)
##
## Call:
## glm(formula = high_use ~ ., family = "binomial", data = alc.1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8095 -0.6842 -0.3740 0.5368 2.8989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.25727 3.03561 -1.732 0.083297 .
## schoolMS -0.32057 0.52486 -0.611 0.541344
## sexM 0.88711 0.32771 2.707 0.006789 **
## age 0.10206 0.14998 0.680 0.496214
## famsizeLE3 0.45316 0.32350 1.401 0.161272
## PstatusT -0.34942 0.52415 -0.667 0.505005
## Medu 0.06630 0.21987 0.302 0.763006
## Fedu 0.14715 0.19563 0.752 0.451929
## Mjobhealth -1.14968 0.75199 -1.529 0.126303
## Mjobother -0.90964 0.49250 -1.847 0.064747 .
## Mjobservices -0.94817 0.55923 -1.695 0.089983 .
## Mjobteacher -0.54168 0.68617 -0.789 0.429865
## Fjobhealth 0.33164 1.19631 0.277 0.781611
## Fjobother 0.87998 0.86972 1.012 0.311635
## Fjobservices 1.31772 0.88582 1.488 0.136863
## Fjobteacher -0.23772 1.03541 -0.230 0.818407
## reasonhome 0.46813 0.38596 1.213 0.225173
## reasonother 1.58250 0.54173 2.921 0.003487 **
## reasonreputation 0.05031 0.41071 0.122 0.902511
## guardianmother -0.69328 0.36620 -1.893 0.058336 .
## guardianother -0.30531 0.78460 -0.389 0.697180
## traveltime 0.41657 0.22354 1.864 0.062385 .
## studytime -0.26555 0.20267 -1.310 0.190112
## schoolsupyes 0.09773 0.46184 0.212 0.832420
## famsupyes -0.03823 0.32434 -0.118 0.906162
## activitiesyes -0.50404 0.30982 -1.627 0.103762
## nurseryyes -0.53578 0.36578 -1.465 0.142979
## higheryes -0.17048 0.71054 -0.240 0.810380
## internetyes -0.04812 0.44144 -0.109 0.913193
## romanticyes -0.56868 0.33272 -1.709 0.087415 .
## famrel -0.57506 0.16594 -3.465 0.000529 ***
## freetime 0.29178 0.16939 1.723 0.084971 .
## goout 0.83792 0.15217 5.507 3.66e-08 ***
## health 0.19799 0.11309 1.751 0.080001 .
## failures 0.27196 0.27381 0.993 0.320599
## paidyes 0.51832 0.31850 1.627 0.103659
## absences 0.09152 0.02628 3.483 0.000496 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 315.42 on 333 degrees of freedom
## AIC: 389.42
##
## Number of Fisher Scoring iterations: 5
# apply function to get train and test error
alc_function(alc.glm1, alc.1)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64324324 0.05675676 0.70000000
## TRUE 0.13513514 0.16486486 0.30000000
## Sum 0.77837838 0.22162162 1.00000000
## [1] "train error: 0.191891891891892"
## [1] "cv$delta: 0.243243243243243"
# decrease the amount of variables (remove some of the insignificant variables)
alc.2 <- select(alc, c('sex', 'age', 'Medu', 'Fedu', 'Mjob', 'reason', 'guardian', 'traveltime', 'higher','famrel', 'goout', 'health', 'absences', 'high_use'))
# train model
alc.glm2 <- glm(high_use ~., data = alc.2, family = 'binomial')
summary(alc.glm2)
##
## Call:
## glm(formula = high_use ~ ., family = "binomial", data = alc.2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7549 -0.7323 -0.4254 0.6112 2.7109
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.147567 2.434744 -1.293 0.196090
## sexM 0.985572 0.291620 3.380 0.000726 ***
## age 0.020333 0.125677 0.162 0.871474
## Medu -0.080218 0.204250 -0.393 0.694509
## Fedu 0.006313 0.171372 0.037 0.970613
## Mjobhealth -0.643558 0.671539 -0.958 0.337895
## Mjobother -0.619779 0.436160 -1.421 0.155320
## Mjobservices -0.472012 0.490653 -0.962 0.336046
## Mjobteacher -0.176883 0.605425 -0.292 0.770161
## reasonhome 0.379839 0.356512 1.065 0.286681
## reasonother 1.281219 0.484228 2.646 0.008147 **
## reasonreputation -0.128246 0.373793 -0.343 0.731528
## guardianmother -0.608783 0.335136 -1.817 0.069290 .
## guardianother 0.226093 0.701806 0.322 0.747332
## traveltime 0.368261 0.199638 1.845 0.065089 .
## higheryes -0.167295 0.640623 -0.261 0.793981
## famrel -0.507684 0.152603 -3.327 0.000878 ***
## goout 0.858391 0.137201 6.256 3.94e-10 ***
## health 0.171406 0.104502 1.640 0.100960
## absences 0.090320 0.024226 3.728 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 341.50 on 350 degrees of freedom
## AIC: 381.5
##
## Number of Fisher Scoring iterations: 5
# repeat calculations
alc_function(alc.glm2, alc.2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.14594595 0.15405405 0.30000000
## Sum 0.78648649 0.21351351 1.00000000
## [1] "train error: 0.205405405405405"
## [1] "cv$delta: 0.245945945945946"
# remove all unsignificant variables
alc.3 <- select(alc, c('sex', 'reason', 'famrel', 'goout', 'absences', 'high_use'))
# train model
alc.glm3 <- glm(high_use ~., data = alc.3, family = 'binomial')
summary(alc.glm3)
##
## Call:
## glm(formula = high_use ~ ., family = "binomial", data = alc.3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8009 -0.7643 -0.4668 0.6793 2.7417
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.78746 0.70260 -3.967 7.27e-05 ***
## sexM 1.05224 0.26954 3.904 9.47e-05 ***
## reasonhome 0.21453 0.33099 0.648 0.516881
## reasonother 1.04957 0.46137 2.275 0.022912 *
## reasonreputation -0.35415 0.35218 -1.006 0.314607
## famrel -0.43597 0.14422 -3.023 0.002503 **
## goout 0.79325 0.12775 6.210 5.32e-10 ***
## absences 0.08343 0.02291 3.641 0.000272 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 356.63 on 362 degrees of freedom
## AIC: 372.63
##
## Number of Fisher Scoring iterations: 5
# repeat calculations
alc_function(alc.glm3, alc.3)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65945946 0.04054054 0.70000000
## TRUE 0.15405405 0.14594595 0.30000000
## Sum 0.81351351 0.18648649 1.00000000
## [1] "train error: 0.194594594594595"
## [1] "cv$delta: 0.194594594594595"
Test error reduces when only significant predictors are in the model. Train error is constantly around 0.19-0.20.
data("Boston")
Boston data setstr(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston data set consists of housing Values in the suburbs of Boston. There are in total 506 observations and 14 variables related to housing properties, such as if the house bounds Charles river (chas), average number of rooms (rm) and median value of owner-occupied homes (medv). Variables are either numerical or integer variables.
# summarize variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# calculate the correlation matrix and round it to see relationships between the variables
cor_matrix <- cor(Boston) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the summary we can see for example that:
- The per capita crime rate by town ranges from 0.006 to 88.976, mean is 3.677.
- Most houses don’t bound Charles river (mean of the dummy variable is 0.069).
- Average number of rooms is 6.3 (ranges from 3.6 to 8.8).
- Age ranges from 2.9 to 100, mean is 68.6.
- There are on average 18.46 pupils per teacher (range 12.6 to 22).
- Median value of homes is 22.53 * 1000 = 22530 $ (range 5000 - 50000).
From the correlation plot we can see for example that:
- Age and dis (weighted mean of distances to five Boston employment centres) have strong negative correlation, so younger people live closer to center
- nox (nitrogen oxides concentration) and dis have strong negative correlation, so more NO close to the center
- nox and age have strong positive correlation, which makes sense if older people live further away from the centers that have bigger NO concentrations
- rad (accessibility to radial highways) and tax (full-value property-tax rate) have strong positive correlation, so houses with access to highway have higher full-value property-tax rate
- rm and medv have strong positive correlation, so houses with more rooms are more expensive
- crim and rad and tax have positive correlation, so more crimes in areas with access to highways and higher property-tax rate
Standardize the dataset and print out summaries of the scaled data. How did the variables change?
# center and standardize variables
boston_scaled <- Boston %>% scale()
# summary of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- boston_scaled %>% as.data.frame()
The ranges of the variables are clearly much more even now, and for example tax that ranged from 187 to 711 or black (1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town) that ranged from 0.32 to 396.9 won’t dominate in the LDA analysis.
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.
# convert scaled crim into numerical
#boston_scaled$crim <- as.numeric(boston_scaled$crim)
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
values =c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label= values)
# look at the table of the new factor crime
crime %>% table()
## .
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
# fit linear discriminant analysis with crime as the target and other variables as predictors
lda.fit <- lda(crime ~., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2376238 0.2574257 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.99223328 -0.9471046 -0.11640431 -0.8990619 0.47289851 -0.9012780
## med_low -0.07611098 -0.3325278 0.01475116 -0.5836638 -0.05946371 -0.3884817
## med_high -0.39488943 0.2250319 0.18195173 0.3545445 0.17590858 0.4373419
## high -0.48724019 1.0170891 -0.08120770 1.1057366 -0.32541776 0.7999376
## dis rad tax ptratio black lstat
## low 0.8891476 -0.6964601 -0.7526592 -0.5008193 0.37218460 -0.76840420
## med_low 0.3660686 -0.5607666 -0.5029658 -0.1117774 0.31762385 -0.22756314
## med_high -0.3579883 -0.4120549 -0.2936276 -0.2415032 0.09471852 -0.03084622
## high -0.8403438 1.6384176 1.5142626 0.7811136 -0.71919052 0.80005577
## medv
## low 0.56570386
## med_low 0.06087808
## med_high 0.21985196
## high -0.69817149
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09648342 0.67635488 -0.96128159
## indus 0.02676274 -0.43328652 0.28265826
## chas -0.07646340 -0.06446963 0.16425348
## nox 0.42781113 -0.62172383 -1.37078461
## rm -0.09885229 -0.10387057 -0.16054784
## age 0.20875932 -0.38392337 -0.22581351
## dis -0.05780905 -0.27582364 0.12252252
## rad 3.10090134 0.93591572 0.03637074
## tax 0.03429995 0.10037886 0.51755273
## ptratio 0.12672324 -0.04553954 -0.28344697
## black -0.13962095 0.04241061 0.10871084
## lstat 0.27779686 -0.22169118 0.35381690
## medv 0.20904882 -0.41194373 -0.20200822
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9468 0.0422 0.0110
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 9 3 0
## med_low 2 22 6 0
## med_high 1 7 13 1
## high 0 0 0 24
In this case, LDA is pretty good at predicting the high crime rates:
- all 23/23 “high” correctly assigned as high
Prediction accuracy of other classes is not as good:
- 16/23 of “low” are correctly assigned as low, 6/23 as med_low, 1/23 as med_high and zero as high
- 19/29 of “med_low” are correctly assigned as med_low, 6/29 as med_high and 4/29 as low
- 19/27 of “med_high” are correctly assigned as med_high, 6/29 as med_low and 2 as high
Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations.
data("Boston")
# scale and convert into data.frame
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results
# k-means clustering with clusters 1 to 5 and plotting
for (i in 1:5) {
km <- kmeans(Boston, centers = i)
pairs(Boston[, c("rm", "age", "dis", "crim")], col = km$cluster)
}
K = 4 an K = 5 seem to be too many, because the clusters overlap. 2 or 3 is maybe more suitable.
Move the country names to rownames. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
human <- read.csv('./data/human.csv', header = T)
dim(human)
## [1] 155 9
human <- column_to_rownames(human, "Country")
head(human)
## Sec_edu_ratio LFPR_ratio Edu_years Life_exp GNI Mat_mort
## Norway 1.0072389 0.8908297 17.5 81.6 64992 4
## Australia 0.9968288 0.8189415 20.2 82.4 42261 6
## Switzerland 0.9834369 0.8251001 15.8 83.0 56431 6
## Denmark 0.9886128 0.8840361 18.7 80.2 44025 5
## Netherlands 0.9690608 0.8286119 17.9 81.6 45435 6
## Germany 0.9927835 0.8072289 16.5 80.9 43919 7
## Adol_birth_rate Rep_parl
## Norway 7.8 39.6
## Australia 12.1 30.5
## Switzerland 1.9 28.5
## Denmark 5.1 38.0
## Netherlands 6.2 36.9
## Germany 3.8 36.9
dim(human)
## [1] 155 8
ggpairs(human, progress = FALSE)
cor_matrix <- cor(human) %>% round(digits=2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the ggpairs plot we can see that many variables are not normally distributed, such as GNI and Mat_mort. Many of the variables are strongly correlated. From the correlation plot we can confirm that many variables have strong positive and negative correlations, but that the variables LFPR_ratio and Rep_parl don’t seem to correlate as much with the other variables.
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables
# perform PCA on non-standardized data
pca_human <- prcomp(human)
summary(pca_human) # proportion of variance and cumulative proportion captured by the components
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw biplot
biplot(pca_human, choices = 1:2)
Since GNI has a lot wider range of values than the other variables and the data hasn’t been standardized, GNI dominates the analysis. Pretty much all the variability in the data is explained by GNI alone and the first principal component explains the majority of the variation.
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to.
# standardize data
human_std <- scale(human)
# perform PCA on standardized data
pca_human_std <- prcomp(human_std)
summary(pca_human_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# create and print out a summary of pca_human
s <- summary(pca_human_std)
# rounded percentanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ], digits = 5)
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr * 100, "%)")
# draw a biplot Use the first value of the `pc_lab` vector as the label for the x-axis and the second value as the label for the y-axis
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The biplot looks different, since after scaling the data, the variables contribute equally to the analysis. The biplot is much more even now. The first principal component now explains 54% of the variation.
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
Biplot shows how strongly each characteristic influences a principal component. We can see that PC1 is is influenced by factors such as years of education and life expectancy, which are positively correlated with each other, as well as Mat_mort, which is negatively correlated with the other 2. PC2 is influenced by LFPR_ratio and Rep_parl, which are also positively correlated with each other.
Load the tea dataset and convert its character variables to factors. Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# There are no character variables, so I'm not sure what should be converted here..
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# make a smaller data set of 'tea'
keep_columns <- c("breakfast", "evening", "work", "Tea", "sugar", "sex", "Sport", "relaxing", "effect.on.health")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect
## 1.1.0.
## Warning: Please use `all_of()` or `any_of()` instead.
## Warning: # Was:
## Warning: data %>% select(keep_columns)
## Warning:
## Warning: # Now:
## Warning: data %>% select(all_of(keep_columns))
## Warning:
## Warning: See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# ggplot
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar()
Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.162 0.145 0.134 0.132 0.119 0.102 0.091
## % of var. 14.570 13.059 12.047 11.859 10.720 9.215 8.149
## Cumulative % of var. 14.570 27.629 39.676 51.534 62.254 71.470 79.619
## Dim.8 Dim.9 Dim.10
## Variance 0.083 0.078 0.066
## % of var. 7.453 7.008 5.919
## Cumulative % of var. 87.072 94.081 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.007 0.000 0.000 | 0.199 0.091 0.035 | 0.407 0.412
## 2 | 0.500 0.515 0.242 | -0.187 0.080 0.034 | 0.595 0.880
## 3 | -0.303 0.190 0.092 | -0.444 0.452 0.196 | -0.082 0.017
## 4 | -0.230 0.109 0.065 | 0.383 0.337 0.181 | -0.314 0.246
## 5 | -0.503 0.522 0.288 | -0.009 0.000 0.000 | 0.387 0.373
## 6 | -0.115 0.027 0.019 | 0.192 0.085 0.052 | 0.071 0.013
## 7 | -0.154 0.049 0.033 | -0.086 0.017 0.010 | 0.324 0.262
## 8 | -0.160 0.053 0.024 | 0.142 0.046 0.019 | 0.593 0.877
## 9 | -0.154 0.049 0.033 | -0.086 0.017 0.010 | 0.324 0.262
## 10 | -0.179 0.066 0.026 | 0.205 0.096 0.034 | 0.849 1.796
## cos2
## 1 0.146 |
## 2 0.342 |
## 3 0.007 |
## 4 0.122 |
## 5 0.170 |
## 6 0.007 |
## 7 0.145 |
## 8 0.334 |
## 9 0.145 |
## 10 0.578 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## breakfast | -0.074 0.182 0.005 -1.235 | -0.495 9.012 0.226
## Not.breakfast | 0.069 0.168 0.005 1.235 | 0.457 8.319 0.226
## evening | -0.830 16.240 0.360 -10.380 | 0.173 0.783 0.016
## Not.evening | 0.434 8.491 0.360 10.380 | -0.090 0.409 0.016
## Not.work | 0.019 0.018 0.001 0.522 | 0.442 10.625 0.478
## work | -0.047 0.044 0.001 -0.522 | -1.082 26.013 0.478
## black | 0.224 0.850 0.016 2.217 | 0.165 0.512 0.009
## Earl Grey | -0.227 2.280 0.093 -5.277 | -0.318 4.982 0.182
## green | 0.826 5.157 0.084 5.024 | 1.491 18.716 0.275
## No.sugar | 0.550 10.742 0.324 9.840 | -0.196 1.515 0.041
## v.test Dim.3 ctr cos2 v.test
## breakfast -8.226 | 0.434 7.490 0.174 7.203 |
## Not.breakfast 8.226 | -0.400 6.914 0.174 -7.203 |
## evening 2.158 | 0.136 0.528 0.010 1.701 |
## Not.evening -2.158 | -0.071 0.276 0.010 -1.701 |
## Not.work 11.961 | 0.110 0.717 0.030 2.983 |
## work -11.961 | -0.270 1.754 0.030 -2.983 |
## black 1.629 | 1.349 37.243 0.596 13.345 |
## Earl Grey -7.385 | -0.494 13.026 0.440 -11.469 |
## green 9.062 | -0.136 0.169 0.002 -0.826 |
## No.sugar -3.498 | 0.458 8.987 0.224 8.184 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## breakfast | 0.005 0.226 0.174 |
## evening | 0.360 0.016 0.010 |
## work | 0.001 0.478 0.030 |
## Tea | 0.121 0.316 0.608 |
## sugar | 0.324 0.041 0.224 |
## sex | 0.101 0.204 0.026 |
## Sport | 0.126 0.015 0.025 |
## relaxing | 0.379 0.002 0.091 |
## effect.on.health | 0.041 0.008 0.018 |
# visualize the model
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
plot(mca, invisible=c("var"), graph.type = "classic", habillage = "quali")
The distance between variables gives a measure of their similarity (or dissimilarity). For example Relaxing and evening seem to contribute most to Dim 1, whereas variables work and Tea contribute greatly to Dim2.
From the individuals’ scatterplot we can see that there is no particular group of individuals, the scatterplot is quite homogeneous.
RATS data has been converted into long format in meet_and_repeat.R, so I import the data:
RATSL <- read.table("./data/RATSL.txt", header = T)
# convert ID and Group into factors
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Draw ggplot with `Time` on the x-axis and `Weight` on the y-axis
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=20)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
Next, we inspect the tracking to see if rats that have higher weight values at the beginning tend to have higher values throughout the study.
The tracking phenomenon can be seen more clearly in a plot of the standardized values of each observation, i.e., the values obtained by subtracting the relevant occasion mean from the original observation and then dividing by the corresponding visit standard deviation:
\[standardised(x) = \frac{x - mean(x)}{ sd(x)}\]
# Group by WD and standardise the variable Weight
RATSL <- RATSL %>%
group_by(WD) %>%
mutate(stdweight = scale(Weight)) %>%
ungroup()
# Plot again with the standardised Weight
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=20)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized weight")
Let’sproduce graphs showing average (mean) profiles for each group along with some indication of the variation of the observations at each time point, in this case the standard error of mean
\[se = \frac{sd(x)}{\sqrt{n}}\]
# Number of subjects (per group):
unique(RATSL$Group) # 3 groups
## [1] 1 2 3
## Levels: 1 2 3
length(which(RATSL$Group ==1))
## [1] 88
length(which(RATSL$Group ==2))
## [1] 44
length(which(RATSL$Group ==3))
## [1] 44
# n <- 20
library(dplyr)
library(tidyr)
# Summary data with mean and standard error of Weight by Group and Time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.4)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight")
Next let’s check for possible outliers
# Create a summary data by Group and ID with mean as the summary variable
RATSL8S <- RATSL %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL8S1 <- RATSL8S[-12, ]
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days 1-64")
T for test and A for Anova
# Add the baseline from the original data as a new variable to the summary data
# RATSL8S2 <- RATSL8S %>%
# mutate(baseline = RATS$WD1)
# Fit the linear model with the mean as the response
# fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
# Compute the analysis of variance table for the fitted model with anova()
# anova(fit)
# create a random intercept and random slope model
library(lme4)
RATS_ref <- lmer(Weight ~ Time + Group + (1 | ID), data = RATSL, REML = FALSE)
RATS_ref1 <- lmer(Weight ~ Time + Group + (Time | ID), data = RATSL, REML = FALSE)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
summary(RATS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Weight ~ Time + Group + (Time | ID)
## Data: RATSL
##
## AIC BIC logLik deviance df.resid
## 1194.2 1219.6 -589.1 1178.2 168
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2259 -0.4322 0.0555 0.5637 2.8825
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## ID (Intercept) 1139.2004 33.7520
## Time 0.1122 0.3349 -0.22
## Residual 19.7479 4.4439
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 246.46821 11.80888 20.871
## Time 0.58568 0.08548 6.852
## Group2 214.55803 20.16986 10.638
## Group3 258.91288 20.16986 12.837
##
## Correlation of Fixed Effects:
## (Intr) Time Group2
## Time -0.166
## Group2 -0.569 0.000
## Group3 -0.569 0.000 0.333
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0028726 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(RATS_ref1, RATS_ref)
## Data: RATSL
## Models:
## RATS_ref: Weight ~ Time + Group + (1 | ID)
## RATS_ref1: Weight ~ Time + Group + (Time | ID)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## RATS_ref 6 1333.2 1352.2 -660.58 1321.2
## RATS_ref1 8 1194.2 1219.6 -589.11 1178.2 142.94 2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BPRSL <- read.table("./data/BPRSL.txt", header = T)
# convert ID and Group into factors
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
The bprs values seem to decrease in both treatment groups during the treatment weeks 1-8, but there doesn’t seem to be significant differences between the groups.
Let’s fit a multiple linear regression model to see if this is indeed correct.
# fit a multiple linear regression model with `pbrs` as response and `week` and `treatment` as explanatory variables
# create a regression model RATS_reg
BPRSL_reg <- lm(data=BPRSL, bprs ~ week + treatment)
# print out a summary of the model
summary(BPRSL_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
In the summary we can see that time (week) is statistically significantly (p-value <2e-16) negatively correlated with pbrs value, so during treatment weeks 1-8 the bprs value goes down. However, the type of treatment (treatment) is not statistically significant, so the type of treatment doesn’t have an effect on the decrease of pbrs value, as could be inspected from the plot above.
The lm model assumes independence of the repeated measures of bprs, and this assumption is highly unlikely. So, now we will move on to consider both some more appropriate graphics and appropriate models.
To begin the more formal analysis of the bprs data, we will first fit the random intercept model for the same two explanatory variables: week and treatment. Fitting a random intercept model allows the linear regression fit for each study subject to differ in intercept from other subjects.
# Create a random intercept model as subject as the random effect
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
We see here again that treatment and time don’t correlate with each other. The t value for treatment is close to zero (0.532), so it’s p-value is large, whereas the t-statistic for week is far from zero (-10.896) with corresponding small p-value, which indicates that it is unlikely to get such results by chance and that the null hypothesis of no correlation between time and bprs should be rejected. VARIABILITY!
Now we can move on to fit the random intercept and random slope model to the BPRS data. Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope. This way it is possible to account for the individual differences in the bprs profiles, but also the effect of time.
# create a random intercept and random slope model with `week` and `subject` as the random effects
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# Compute the ANOVA analysis of variance tables of the models `RATS_ref` and `RATS_ref1
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BPRS_ref1 (a random intercept and random slope model) is better than BPRS_ref (a random intercept model) in terms of chi-squared statistics (7.2721) and p-value of the likelihood ratio test (0.02636) So the fit of BPRS_ref1 with both random intercept and random slope is better than the fit of BPRS_ref with only rendom intercept.
Finally, we can fit a random intercept and slope model that allows for a treatment group × time interaction, even though we saw that treatment and time don’t correlate with each other.
# Write the same model as in the previous exercise but add `week` * `treatment` interaction.
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week * treatment), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As could be expected, the model with the week * treatment interaction variable is not statistically better than the model without interaction (chi-squared statistic 3.1712, p-value 0.07495). So here the addition of the interaction was not necessary and didn’t improve the model.